Dimensionality Reduction via Principal Components Analysis (PCA)

To understand PCA, check out the slides that accompany this code sample at bitly.com/adi-pca-slides.

Example 1: Point cloud

Setup

We'll be using PCA from scikit-learn, a commonly used machine learning library in Python.

You will need to install some dependencies first if you haven't already. On Linux or Mac, you can run

sudo easy_install pip

to get a Python package manager called pip. Then run

pip install -U numpy
pip install -U scipy
pip install -U scikit-learn

to get scikit-learn.


In [1]:
from sklearn.decomposition import PCA
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D   # 3D Plotting
from scipy import stats
import seaborn as sns                     # prettier styling for plots
import ipywidgets                         # interactions

%matplotlib inline

Create data

Now we will create a pathological case dataset to really demonstrate the use of PCA. Let's create a point cloud which is very flat in one direction. This way we can show that PCA "discards it" to reduce dimensionality since it doesn't contribute much information.


In [2]:
np.random.seed(4) # Reproducible results

mean = np.zeros(2)
cov = [[1, 0.9],
       [0.9, 1]]

data = np.random.multivariate_normal(mean, cov, size=500)
xs = data[:, 0]
ys = data[:, 1]

plt.scatter(xs, ys, marker=".", color="blue")


Out[2]:
<matplotlib.collections.PathCollection at 0x7f5dba7202e8>

As you can see, most of the variation is in just one direction. We can formalize this intuition by using PCA to draw out the appropriate axes:


In [3]:
pca = PCA(n_components=2)
pca.fit(data)

axis1 = pca.components_[0]     # axis of most variation
axis2 = pca.components_[1]     # axis of second-most variation

plt.scatter(xs, ys, marker='.', color="blue")

plt.plot(axis1[0] * np.arange(-4, 5), axis1[1] * np.arange(-4, 5), linewidth=4,
         color="red")
plt.plot(axis2[0] * np.arange(-1, 2), axis2[1] * np.arange(-1, 2), linewidth=2, 
         color="red")


Out[3]:
[<matplotlib.lines.Line2D at 0x7f5dba652f98>]

The two red lines define our new informative axis. As you can see, most of the information is stored in how far along the thicker red line the points are.

PCA doesn't just work for two dimensions. We could do it for large dimensional data. Here's a visualization with three dimensions:


In [4]:
from ipywidgets import interact
mean = np.zeros(3)
cov = [[1, 0, 0],
       [0, 1, 0.9],
       [0, 0.9, 1]]

data = np.random.multivariate_normal(mean, cov, 1000)
xs = data[:, 0]
ys = data[:, 1]
zs = data[:, 2]

figure1 = plt.figure()
ax1 = Axes3D(figure1)
ax1.scatter(xs, ys, zs, marker='.')
plt.close(figure1)  # prevent double-display with interact

# You must be running the Jupyter notebook for interactions to work.
# It will just be a static image when viewed on GitHub or Nbviewer

@interact(elev=(0, 180), azim=(0, 180))
def plot_point_cloud(elev, azim):
    ax1.view_init(elev=elev, azim=azim)
    return figure1



In [5]:
pca = PCA(n_components=2)
pca.fit(data)

axis1 = pca.components_[0]
axis2 = pca.components_[1]

In [6]:
a, b, c = np.cross(axis1, axis2)

# By definition of cross product, <a, b, c> is orthogonal to the plane
# spanned by axis1 and axis2 through (0, 0, 0). The plane's equation is thus:
#      ax + by + cz = 0
# or   z = -(ax + by) / c
xx, yy = np.meshgrid(np.arange(-4, 4), np.arange(-4, 4))
zz = -(a * xx + b * yy) / c

figure2 = plt.figure()
ax2 = Axes3D(figure2)
ax2.plot_surface(xx, yy, zz, color="red", alpha=0.5)
ax2.scatter(xs, ys, zs, marker='.')
plt.close(figure2)

@interact(elev=(0, 180), azim=(0, 180))
def plot_point_cloud(elev, azim):
    ax2.view_init(elev=elev, azim=azim)
    return figure2


Example 2: Eigenfaces

Summary

We'll be using an old dataset of creepy grayscale faces from AT&T called the Olivetti Faces. Images of human faces are really complex. If we want to do facial recognition, we need to simplify them first. PCA is a great way to capture the distinctive features of a person's face that might help us in classifying them. So in this example, we're going to extract and plot only the principal components from the faces. Be warned, the results look a bit spooky.


In [7]:
from numpy.random import RandomState
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces
from sklearn import decomposition

# Load faces data
faces = fetch_olivetti_faces(shuffle=True, random_state=RandomState(1)).data
n_samples, n_features = faces.shape

# Center the faces
faces_centered = faces - faces.mean(axis=0)
faces_centered -= faces_centered.mean(axis=1) \
                                .reshape(n_samples, -1)

print("Dataset consists of %d faces" % n_samples)

def plot_gallery(title, images, n_col=3, n_row=2):
    """
    Helper function to plot images.
    """
    image_shape = (64, 64)

    plt.figure(figsize=(2. * n_col, 2.26 * n_row))
    plt.suptitle(title, size=16)
    
    for i, comp in enumerate(images):
        plt.subplot(n_row, n_col, i + 1)
        vmax = max(comp.max(), -comp.min())
        plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray,
                   interpolation='nearest',
                   vmin=-vmax, vmax=vmax)
        plt.xticks([])
        plt.yticks([])
    plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)
    
n_components = 6

# Plot a sample of the input data
plot_gallery("Input", faces_centered[:n_components])

# Apply PCA and plot results
print("Extracting the top %d components" % (n_components))
data = faces_centered
# We use a variant of PCA called Randomized PCA for efficiency. It uses stochastic SVD.
estimator = decomposition.RandomizedPCA(n_components=n_components, whiten=True)
estimator.fit(data)
plot_gallery('PCA', estimator.components_[:n_components])

plt.show()


Dataset consists of 400 faces
Extracting the top 6 components

In [ ]: